In [1]:
import sys
sys.path.append('..')
from twords.twords import Twords
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
# this pandas line makes the dataframe display all text in a line; useful for seeing entire tweets
pd.set_option('display.max_colwidth', -1)
In [ ]:
twit_mars = Twords()
# set path to folder that contains jar files for twitter search
twit_mars.jar_folder_path = "../jar_files_and_background/"
This function collects tweets and puts them into a single folder in the form needed to read them into a Twords object using get_java_tweets_from_csv_list.
For more information of create_java_tweets arguments see source code in twords.py file
total_num_tweets: (int) total number of tweets to collect
tweets_per_run: (int) number of tweets per call to java tweet collector; from experience best to keep around 10,000 for large runs (for runs less than 10,000 can just set tweets_per_run to same value as total_num_tweets)
querysearch: (string) search query - for example, "charisma" or "mars rover"; a space between words implies an "and" operator between them: only tweets with both terms will be returned
final_until: (string) the date to search backward in time from; has form '2015-07-31'; for example, if date is '2015-07-31', then tweets are collected backward in time from that date. If left as None, uses current date to search backward from
output_folder: (string) name of folder to put output files in
decay_factor: (int) how quickly to wind down tweet search if errors occur and no tweets are found in a run - a failed run will count as tweets_per_run/decay_factor tweets found, so the higher the factor the longer the program will try to search for tweets even if it gathers none in a run
all_tweets: (bool) whether to return "all tweets" (as defined on twitter website) or "top tweets"; the details behind these designations are mysteries only Twitter knows, but from experiment on website "top tweets" appear to be subset of "all tweets" that Twitter considers interesting; there is no guarantee that this will return literally every tweet, and experiment suggests even "all tweets" does not return every single tweet that given search query may match
Try collecting tweets about the mars rover:
In [ ]:
twit_mars.create_java_tweets(total_num_tweets=100, tweets_per_run=50, querysearch="mars rover",
final_until=None, output_folder="mars_rover",
decay_factor=4, all_tweets=True)
In [ ]:
twit_mars.get_java_tweets_from_csv_list()
In [ ]:
twit_mars.tweets_df.head(5)
This function collects all user tweets that are available from twitter website by scrolling. As an example, a run of this function collected about 87% of the tweets from user barackobama.
To avoid problems with scrolling on the website (which is what the java tweet collector programmatically does), best if tweets_per_run is set to be around 500.
This function may sometimes return multiple copies of the same tweet, which can be removed in the resulting pandas dataframe once the data is read into Twords.
user: (string) twitter handle of user to gather tweets from
tweets_per_run (int) number of tweets to collect in a single call to java tweet collector; some experimentation is required to see which number ends up dropping the fewest tweets - 500 seems to be a decent value
In [ ]:
twit = Twords()
twit.jar_folder_path = "../jar_files_and_background/"
twit.get_all_user_tweets("barackobama", tweets_per_run=500)
In [2]:
twit = Twords()
twit.data_path = "barackobama"
twit.get_java_tweets_from_csv_list()
twit.convert_tweet_dates_to_standard()
If you want to sort the tweets by retweets or favorites, you'll need to convert the retweets and favorites columns from unicode into integers:
In [3]:
twit.tweets_df["retweets"] = twit.tweets_df["retweets"].map(int)
twit.tweets_df["favorites"] = twit.tweets_df["favorites"].map(int)
In [4]:
twit.tweets_df.sort_values("favorites", ascending=False)[:5]
Out[4]:
In [5]:
twit.tweets_df.sort_values("retweets", ascending=False)[:5]
Out[5]:
The Twords word frequency analysis can also be applied to these tweets. In this case there was no search term.
In [6]:
twit.background_path = '../jar_files_and_background/freq_table_72319443_total_words_twitter_corpus.csv'
twit.create_Background_dict()
twit.create_Stop_words()
In [7]:
twit.keep_column_of_original_tweets()
twit.lower_tweets()
twit.keep_only_unicode_tweet_text()
twit.remove_urls_from_tweets()
twit.remove_punctuation_from_tweets()
twit.drop_non_ascii_characters_from_tweets()
twit.drop_duplicate_tweets()
twit.convert_tweet_dates_to_standard()
twit.sort_tweets_by_date()
Make word frequency dataframe:
In [8]:
twit.create_word_bag()
twit.make_nltk_object_from_word_bag()
twit.create_word_freq_df(10000)
In [9]:
twit.word_freq_df.sort_values("log relative frequency", ascending = False, inplace = True)
twit.word_freq_df.head(20)
Out[9]:
In [10]:
twit.tweets_containing("sotu")[:10]
Out[10]:
Now plot relative frequency results. We see from word_freq_df that the largest relative frequency terms are specialized things like "sotu" (state of the union) and specific policy-related words like "middle-class." We'll increase the requirement on background words to remove these policy-specific words and get at more general words that the president's twitter account nevertheless uses more often than usual:
In [11]:
num_words_to_plot = 32
background_cutoff = 100
twit.word_freq_df[twit.word_freq_df['background occurrences']>background_cutoff].sort_values("log relative frequency", ascending=True).set_index("word")["log relative frequency"][-num_words_to_plot:].plot.barh(figsize=(20,
num_words_to_plot/2.), fontsize=30, color="c");
plt.title("log relative frequency", fontsize=30);
ax = plt.axes();
ax.xaxis.grid(linewidth=4);
In [12]:
num_words_to_plot = 50
background_cutoff = 1000
twit.word_freq_df[twit.word_freq_df['background occurrences']>background_cutoff].sort_values("log relative frequency", ascending=True).set_index("word")["log relative frequency"][-num_words_to_plot:].plot.barh(figsize=(20,
num_words_to_plot/2.), fontsize=30, color="c");
plt.title("log relative frequency", fontsize=30);
ax = plt.axes();
ax.xaxis.grid(linewidth=4);
The month of January appears to carry special import with the president's twitter account.
In [13]:
num_words_to_plot = 32
background_cutoff = 5000
twit.word_freq_df[twit.word_freq_df['background occurrences']>background_cutoff].sort_values("log relative frequency", ascending=True).set_index("word")["log relative frequency"][-num_words_to_plot:].plot.barh(figsize=(20,
num_words_to_plot/2.), fontsize=30, color="c");
plt.title("log relative frequency", fontsize=30);
ax = plt.axes();
ax.xaxis.grid(linewidth=4);
In [14]:
num_words_to_plot = 32
background_cutoff = 5000
twit.word_freq_df[twit.word_freq_df['background occurrences']>background_cutoff].sort_values("log relative frequency", ascending=False).set_index("word")["log relative frequency"][-num_words_to_plot:].plot.barh(figsize=(20,
num_words_to_plot/2.), fontsize=30, color="c");
plt.title("log relative frequency", fontsize=30);
ax = plt.axes();
ax.xaxis.grid(linewidth=4);
The presidency is no place for posting hate or androids.
In [ ]: